BY:
- **MOHAMMAD ALSHAYE** : 202065240
# this corresponds to (a) under data preparation section of the project guidline
# reding and listing the variables then identifying their types
#1. Reading & displaying the data
import pandas as pd
df = pd.read_csv('chip_dataset.csv', delimiter = ',')
display(df.info())
print("-" * 100)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4854 entries, 0 to 4853 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 4854 non-null int64 1 Product 4854 non-null object 2 Type 4854 non-null object 3 Release Date 4818 non-null object 4 Process Size (nm) 4845 non-null float64 5 TDP (W) 4228 non-null float64 6 Die Size (mm^2) 4139 non-null float64 7 Transistors (million) 4143 non-null float64 8 Freq (MHz) 4854 non-null int64 9 Foundry 4854 non-null object 10 Vendor 4854 non-null object 11 FP16 GFLOPS 536 non-null float64 12 FP32 GFLOPS 1948 non-null float64 13 FP64 GFLOPS 1306 non-null float64 dtypes: float64(7), int64(2), object(5) memory usage: 531.0+ KB
None
----------------------------------------------------------------------------------------------------
Identifying the fields of the data.
From the above table, we can see the following:
Number of rows is 4854.
Number of columns is 14.
the fields are *Unnamed, Product, Type, Release Date, Process Size (nm), TDP (W), Die Size (mm^2), Transistors (million), Freq (MHz), Foundry, Vendor, FP16 GFLOPS, FP32 GFLOPS, FP64 GFLOPS .
Identifying the type for each field based on value. Also, identifying the datatypes in Python. The following table gives the required information:
| Field | Type | Description |
|---|---|---|
| Unnamed | Numeric | Index |
| Product | Categorical | Product |
| Type | Categorical | CPU or GPU |
| Release Date | Categorical | Release Date |
| Process Size (nm) | Numeric | Process Size in nanometers |
| TDP (W) | Numeric | Thermal Design Power in Watts |
| Die Size (mm^2) | Numeric | Die Size in squared millimeters |
| Transistors (million) | Numeric | Transistors in millions |
| Freq (MHz) | Numeric | Frequency in megahertz |
| Foundry | Categorical | the company that manufactured the chip |
| Vendor | Categorical | the company that designed the chip |
| FP16 GFLOPS | Numeric | Giga Floating-Point Operations per Second using floating-point (FP16) arithmetic |
| FP32 GFLOPS | Numeric | Giga Floating-Point Operations per Second using floating-point (FP32) arithmetic |
| FP64 GFLOPS | Numeric | Giga Floating-Point Operations per Second using floating-point (FP64) arithmetic |
# this corresponds to (a) under data preparation section of the project guidline
# the first column (Unamed) should be droped as it is not needed.
df.drop("Unnamed: 0",inplace=True, axis=1)
# identifying the percetage of Nan values in the columns
numberOfRows = len(df.index)
display(round(df.isna().sum()/numberOfRows*100))
Product 0.0 Type 0.0 Release Date 1.0 Process Size (nm) 0.0 TDP (W) 13.0 Die Size (mm^2) 15.0 Transistors (million) 15.0 Freq (MHz) 0.0 Foundry 0.0 Vendor 0.0 FP16 GFLOPS 89.0 FP32 GFLOPS 60.0 FP64 GFLOPS 73.0 dtype: float64
import numpy as np
# Since FP16 GFLOPS has so many Nans, it is better to drop it.
df.drop("FP16 GFLOPS",inplace=True, axis=1)
# Fixing Incosistencies in Release Date
# There are two problems with Release Date column: one is the NaT values and the other is the Nan values
# Replace "NaT" values with None
df["Release Date"].replace("NaT", None, inplace=True)
# Impute Release Date column with the mode
mode = df["Release Date"].mode()[0] # get the mode value
df["Release Date"].fillna(mode, inplace=True)
# Impute numerical columns with the mean
numeric_cols = df.select_dtypes(include=np.number).columns # get the numerical columns
for col in numeric_cols:
col_mean = df[col].mean()
df[col].fillna(col_mean, inplace=True)
df[col] = df[col].round() # round to the nearest integer
display(df.head())
print("The number of Nan values:")
print(df.isna().sum())
| Product | Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | Foundry | Vendor | FP32 GFLOPS | FP64 GFLOPS | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AMD Athlon 64 3500+ | CPU | 2/20/2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | Unknown | AMD | 2135.0 | 364.0 |
| 1 | AMD Athlon 200GE | CPU | 9/6/2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | Unknown | AMD | 2135.0 | 364.0 |
| 2 | Intel Core i5-1145G7 | CPU | 9/2/2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | Intel | Intel | 2135.0 | 364.0 |
| 3 | Intel Xeon E5-2603 v2 | CPU | 9/1/2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | Intel | Intel | 2135.0 | 364.0 |
| 4 | AMD Phenom II X4 980 BE | CPU | 5/3/2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | Unknown | AMD | 2135.0 | 364.0 |
The number of Nan values: Product 0 Type 0 Release Date 0 Process Size (nm) 0 TDP (W) 0 Die Size (mm^2) 0 Transistors (million) 0 Freq (MHz) 0 Foundry 0 Vendor 0 FP32 GFLOPS 0 FP64 GFLOPS 0 dtype: int64
# We only need the year in our analysis, so the days and monthes should be dropped
df["Release Date"] = df["Release Date"].apply(lambda x: int(x[-4:]))
import matplotlib.pyplot as plt
import seaborn as sns
# Selecting numerical columns
boxplot_cols = df.select_dtypes(include="number").columns.drop("Release Date")
# Setting up plot
n_cols = 2
n_rows = len(boxplot_cols) // n_cols + 1
fig, axes = plt.subplots(n_rows, n_cols, figsize=(25, 25))
print("Before removing the outliers")
# Creating boxplots
for i, col in enumerate(boxplot_cols):
axis = axes.flatten()[i]
sns.boxplot(data=df, x='Release Date', y=col, ax=axes.flatten()[i])
plt.tight_layout()
plt.show()
# Removing outliers
for c in df[boxplot_cols]:
q3, q1 = np.percentile(df[c].tolist(), [99,1]) #Removing outliers that are the top and the bottom 1% of each column
RowsThatHasOutLiers = ((df[c] <= q1) | (df[c] >= q3) ) # Mask
outliersIndexes = df[RowsThatHasOutLiers].index # Using the mask to
df.drop(outliersIndexes, axis = 0, inplace = True)
# Selecting numerical columns
numColumns = df.select_dtypes(include="number").columns.drop("Release Date")
# Setting up plot
n_cols = 2
n_rows = len(boxplot_cols) // n_cols + 1
fig, axes = plt.subplots(n_rows, n_cols, figsize=(25, 25))
print("After removing the outliers")
# Creating boxplots
for i, col in enumerate(boxplot_cols):
axis = axes.flatten()[i]
sns.boxplot(data=df, x='Release Date', y=col, ax=axes.flatten()[i])
plt.tight_layout()
plt.show()
Before removing the outliers
After removing the outliers
#The higher the frequency, the higher the performance
#Creating boolean masks for CPU and GPU
CPU = df["Type"] == "CPU"
GPU = df["Type"] == "GPU"
#Defining a function to categorize performance for CPUs
def performanceCPU(x):
th1, th2, th3 = np.percentile(df.loc[CPU,"Freq (MHz)"], [33,66,100]) #Calculating percentiles for CPUs only
if x < th1:
return "Low"
elif x<th2:
return "Medium"
else:
return "High"
#Defining a function to categorize performance for GPUs
def performanceGPU(x):
th1, th2, th3 = np.percentile(df.loc[GPU,"Freq (MHz)"], [33,66,100]) #Calculating percentiles for GPUs only
if x < th1:
return "Low"
elif x<th2:
return "Medium"
else:
return "High"
#Creating a new column to hold the performance category for all processors
df["Performance"] = np.nan
#Filling the performance category column with values for CPUs
df.loc[CPU,"Performance"] = df.loc[CPU,"Freq (MHz)"].apply(performanceCPU)
#Filling the performance category column with values for GPUs
df.loc[GPU,"Performance"] = df.loc[GPU,"Freq (MHz)"].apply(performanceGPU)
display(df)
| Product | Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | Foundry | Vendor | FP32 GFLOPS | FP64 GFLOPS | Performance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AMD Athlon 64 3500+ | CPU | 2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | Unknown | AMD | 2135.0 | 364.0 | Medium |
| 1 | AMD Athlon 200GE | CPU | 2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | Unknown | AMD | 2135.0 | 364.0 | High |
| 2 | Intel Core i5-1145G7 | CPU | 2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | Intel | Intel | 2135.0 | 364.0 | Medium |
| 3 | Intel Xeon E5-2603 v2 | CPU | 2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | Intel | Intel | 2135.0 | 364.0 | Low |
| 4 | AMD Phenom II X4 980 BE | CPU | 2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | Unknown | AMD | 2135.0 | 364.0 | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4844 | ATI FirePro V7800 | GPU | 2010 | 40.0 | 150.0 | 334.0 | 2154.0 | 700 | TSMC | ATI | 2016.0 | 403.0 | Medium |
| 4846 | NVIDIA Playstation 3 GPU 28nm | GPU | 2013 | 28.0 | 21.0 | 68.0 | 302.0 | 550 | Sony | NVIDIA | 2135.0 | 364.0 | Medium |
| 4849 | NVIDIA Quadro 3000M | GPU | 2011 | 40.0 | 75.0 | 332.0 | 1950.0 | 450 | TSMC | NVIDIA | 432.0 | 36.0 | Low |
| 4850 | Intel GMA 950 | GPU | 2005 | 90.0 | 7.0 | 188.0 | 1930.0 | 250 | Intel | Intel | 2135.0 | 364.0 | Low |
| 4851 | NVIDIA GeForce GT 320M | GPU | 2010 | 40.0 | 23.0 | 100.0 | 486.0 | 500 | TSMC | NVIDIA | 53.0 | 364.0 | Medium |
3932 rows × 13 columns
# to save the cleand data
df.to_csv('chip_dataset_cleaned.csv', index = False, header=True)
df = pd.read_csv('chip_dataset_cleaned.csv')
display(df.head())
display(df.describe())
display(df.select_dtypes(include='object').describe())
| Product | Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | Foundry | Vendor | FP32 GFLOPS | FP64 GFLOPS | Performance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AMD Athlon 64 3500+ | CPU | 2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | Unknown | AMD | 2135.0 | 364.0 | Medium |
| 1 | AMD Athlon 200GE | CPU | 2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | Unknown | AMD | 2135.0 | 364.0 | High |
| 2 | Intel Core i5-1145G7 | CPU | 2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | Intel | Intel | 2135.0 | 364.0 | Medium |
| 3 | Intel Xeon E5-2603 v2 | CPU | 2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | Intel | Intel | 2135.0 | 364.0 | Low |
| 4 | AMD Phenom II X4 980 BE | CPU | 2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | Unknown | AMD | 2135.0 | 364.0 | High |
| Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | |
|---|---|---|---|---|---|---|---|---|
| count | 3932.000000 | 3932.000000 | 3932.000000 | 3932.000000 | 3932.000000 | 3932.000000 | 3932.000000 | 3932.000000 |
| mean | 2010.478891 | 54.936165 | 70.707782 | 178.016277 | 1308.163021 | 1538.046033 | 1727.756104 | 303.717701 |
| std | 5.067623 | 39.522290 | 47.181074 | 84.741565 | 1569.351396 | 1049.048905 | 905.849215 | 119.798022 |
| min | 2001.000000 | 8.000000 | 5.000000 | 57.000000 | 36.000000 | 220.000000 | 30.000000 | 13.000000 |
| 25% | 2006.000000 | 28.000000 | 35.000000 | 118.000000 | 181.000000 | 600.000000 | 1009.000000 | 364.000000 |
| 50% | 2011.000000 | 40.000000 | 65.000000 | 177.000000 | 754.000000 | 1124.500000 | 2135.000000 | 364.000000 |
| 75% | 2014.000000 | 90.000000 | 86.000000 | 215.000000 | 1930.000000 | 2400.000000 | 2135.000000 | 364.000000 |
| max | 2021.000000 | 150.000000 | 294.000000 | 529.000000 | 10800.000000 | 3800.000000 | 5990.000000 | 529.000000 |
| Product | Type | Foundry | Vendor | Performance | |
|---|---|---|---|---|---|
| count | 3932 | 3932 | 3932 | 3932 | 3932 |
| unique | 3447 | 2 | 9 | 5 | 3 |
| top | AMD Athlon 64 3200+ | GPU | TSMC | AMD | High |
| freq | 12 | 2019 | 1593 | 1314 | 1401 |
# Graphs Based on Single Numeric Variable
num_columns = df.select_dtypes(exclude='object').columns
fig,axes = plt.subplots(3, 3, figsize=(25,25))
for ind,cl in enumerate(num_columns):
sns.histplot(x=cl,bins=10,kde=True,data=df,ax=axes.flatten()[ind])
plt.show()
# Graphs Based on Single Non-Numeric Variable
cat_columns = df.select_dtypes('object').columns.drop('Product')
fig,axes = plt.subplots(2, 2, figsize=(25,25))
for ind,col in enumerate(cat_columns):
axis=axes.flatten()[ind]
sns.countplot(x=col,data=df,ax=axis)
axis.set_xticklabels(axis.get_xticklabels(), rotation=80)
plt.show()
# Graphs Based on Two Numeric Variable
num_columns = df.select_dtypes(exclude='object').columns
plt.figure(figsize=(25,25))
sns.pairplot(vars=num_columns,hue ='Performance',data=df)
plt.show()
<Figure size 1800x1800 with 0 Axes>
# Graphs Based on Two Numeric Variable
cat_columns = df.select_dtypes('object').columns.drop(labels =['Product', 'Performance'])
fig,axes = plt.subplots(2, 2, figsize=(25,25))
for ind,col in enumerate(cat_columns):
axis=axes.flatten()[ind]
sns.countplot(x=col,hue= "Performance" ,data=df,ax=axis)
plt.show()
# Graphs Based on more than two Numeric Variable
plt.figure(figsize=(20,10))
sns.scatterplot(x='Release Date', y='Process Size (nm)', color='red', label='Process Size (nm) Score', alpha=0.5, data=df)
sns.scatterplot(x='Release Date', y='TDP (W)',color='blue',label='TDP (W) Score', alpha=0.5, data=df)
sns.scatterplot(x='Release Date', y='FP32 GFLOPS',color='green',label='FP32 GFLOPS Score', alpha=0.5, data=df)
sns.scatterplot(x='Release Date', y='FP64 GFLOPS',color='yellow',label='FP64 GFLOPS Score', alpha=0.5, data=df)
plt.ylabel("Properties");
plt.xlabel("Release Dates");
plt.title("Comparison between Process Size (nm) & TDP (W) & FP32 GFLOPS & FP64 GFLOPS");
plt.show()
# Graphs Based on more than two Numeric Variable
plt.figure(figsize=(20,10))
sns.scatterplot(x='Release Date', y='Die Size (mm^2)', color='red', label='Die Size (mm^2) Score', alpha=0.5, data=df)
sns.scatterplot(x='Release Date', y='Transistors (million)',color='blue',label='Transistors (million) Score', alpha=0.5, data=df)
sns.scatterplot(x='Release Date', y='Freq (MHz)',color='green',label='Freq (MHz) Score', alpha=0.5, data=df)
plt.ylabel("Properties");
plt.xlabel("Release Dates");
plt.title("Comparison between Die Size (mm^2) & Transistors (million) & Freq (MHz)");
plt.show()
corr = df.corr(method = 'pearson')
f,ax = plt.subplots(figsize=(25, 25))
sns.heatmap(corr,cbar=True,square=True,fmt='.1f',annot=True,annot_kws={'size':15},cmap="Reds")
plt.title('Relationship between Variables', fontsize = 20)
plt.show()
print("-----------------------------------------------------------")
print("The most positive correlation is between Release Date and Transistors (million)")
print("-----------------------------------------------------------")
print("The most negative correlation is between Release Date and Process Size (nm)")
print("-----------------------------------------------------------")
----------------------------------------------------------- The most positive correlation is between Release Date and Transistors (million) ----------------------------------------------------------- The most negative correlation is between Release Date and Process Size (nm) -----------------------------------------------------------
# Graphs Showes the Relations Between Variables
cat_columns = df.select_dtypes('object').columns.drop(labels =['Product', 'Performance'])
num_columns = df.select_dtypes(exclude='object').columns
fig,axes = plt.subplots(len(cat_columns), len(num_columns), figsize=(40,30))
for c,nCol in enumerate(num_columns):
for r,cCol in enumerate(cat_columns):
axis=axes[r][c]
sns.boxplot(x=cCol,y=nCol,hue='Performance',data=df, ax=axis)
axis.set_xticklabels(axis.get_xticklabels(), rotation=80)
plt.show()
ctab2 = pd.crosstab([df['Vendor'], df['Foundry'], df['Type']], df['Performance'], margins = True, margins_name = "Total")
display(ctab2)
plt.figure(figsize=(20,20))
sns.heatmap(pd.crosstab([df['Vendor'], df['Foundry'], df['Type']], df['Performance']), cmap = "YlGnBu",annot = True)
plt.show()
print("------------------------------------------------------------------------------")
print("As we can see from the graph that most of the greatest performing CPUs are produced by Intel")
print("------------------------------------------------------------------------------")
print("As we can see from the graph that most of the lowest performing CPUs are produced by AMD")
print("------------------------------------------------------------------------------")
| Performance | High | Low | Medium | Total | ||
|---|---|---|---|---|---|---|
| Vendor | Foundry | Type | ||||
| AMD | GF | CPU | 52 | 18 | 19 | 89 |
| GPU | 92 | 37 | 7 | 136 | ||
| Renesas | GPU | 0 | 0 | 1 | 1 | |
| TSMC | GPU | 211 | 34 | 130 | 375 | |
| Unknown | CPU | 164 | 313 | 235 | 712 | |
| GPU | 0 | 1 | 0 | 1 | ||
| ATI | NEC | GPU | 0 | 1 | 0 | 1 |
| TSMC | GPU | 50 | 164 | 208 | 422 | |
| UMC | GPU | 0 | 21 | 7 | 28 | |
| Unknown | GPU | 0 | 27 | 11 | 38 | |
| Intel | Intel | CPU | 490 | 300 | 322 | 1112 |
| GPU | 6 | 79 | 21 | 106 | ||
| NVIDIA | Samsung | GPU | 33 | 0 | 1 | 34 |
| Sony | GPU | 0 | 0 | 4 | 4 | |
| TSMC | GPU | 303 | 206 | 287 | 796 | |
| UMC | GPU | 0 | 1 | 7 | 8 | |
| Unknown | GPU | 0 | 35 | 14 | 49 | |
| Other | UMC | GPU | 0 | 20 | 0 | 20 |
| Total | 1401 | 1257 | 1274 | 3932 |
------------------------------------------------------------------------------ As we can see from the graph that most of the greatest performing CPUs are produced by Intel ------------------------------------------------------------------------------ As we can see from the graph that most of the lowest performing CPUs are produced by AMD ------------------------------------------------------------------------------
# Import most of the needed liberaries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Reading the data file and displaying it
df = pd.read_csv('chip_dataset_cleaned.csv')
display(df)
display(df.describe().T)
| Product | Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | Foundry | Vendor | FP32 GFLOPS | FP64 GFLOPS | Performance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AMD Athlon 64 3500+ | CPU | 2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | Unknown | AMD | 2135.0 | 364.0 | Medium |
| 1 | AMD Athlon 200GE | CPU | 2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | Unknown | AMD | 2135.0 | 364.0 | High |
| 2 | Intel Core i5-1145G7 | CPU | 2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | Intel | Intel | 2135.0 | 364.0 | Medium |
| 3 | Intel Xeon E5-2603 v2 | CPU | 2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | Intel | Intel | 2135.0 | 364.0 | Low |
| 4 | AMD Phenom II X4 980 BE | CPU | 2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | Unknown | AMD | 2135.0 | 364.0 | High |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3927 | ATI FirePro V7800 | GPU | 2010 | 40.0 | 150.0 | 334.0 | 2154.0 | 700 | TSMC | ATI | 2016.0 | 403.0 | Medium |
| 3928 | NVIDIA Playstation 3 GPU 28nm | GPU | 2013 | 28.0 | 21.0 | 68.0 | 302.0 | 550 | Sony | NVIDIA | 2135.0 | 364.0 | Medium |
| 3929 | NVIDIA Quadro 3000M | GPU | 2011 | 40.0 | 75.0 | 332.0 | 1950.0 | 450 | TSMC | NVIDIA | 432.0 | 36.0 | Low |
| 3930 | Intel GMA 950 | GPU | 2005 | 90.0 | 7.0 | 188.0 | 1930.0 | 250 | Intel | Intel | 2135.0 | 364.0 | Low |
| 3931 | NVIDIA GeForce GT 320M | GPU | 2010 | 40.0 | 23.0 | 100.0 | 486.0 | 500 | TSMC | NVIDIA | 53.0 | 364.0 | Medium |
3932 rows × 13 columns
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Release Date | 3932.0 | 2010.478891 | 5.067623 | 2001.0 | 2006.0 | 2011.0 | 2014.0 | 2021.0 |
| Process Size (nm) | 3932.0 | 54.936165 | 39.522290 | 8.0 | 28.0 | 40.0 | 90.0 | 150.0 |
| TDP (W) | 3932.0 | 70.707782 | 47.181074 | 5.0 | 35.0 | 65.0 | 86.0 | 294.0 |
| Die Size (mm^2) | 3932.0 | 178.016277 | 84.741565 | 57.0 | 118.0 | 177.0 | 215.0 | 529.0 |
| Transistors (million) | 3932.0 | 1308.163021 | 1569.351396 | 36.0 | 181.0 | 754.0 | 1930.0 | 10800.0 |
| Freq (MHz) | 3932.0 | 1538.046033 | 1049.048905 | 220.0 | 600.0 | 1124.5 | 2400.0 | 3800.0 |
| FP32 GFLOPS | 3932.0 | 1727.756104 | 905.849215 | 30.0 | 1009.0 | 2135.0 | 2135.0 | 5990.0 |
| FP64 GFLOPS | 3932.0 | 303.717701 | 119.798022 | 13.0 | 364.0 | 364.0 | 364.0 | 529.0 |
print('The current companies list:')
display(df['Vendor'].unique().tolist()) # there are some other companies that are not shown in vendor column
Company = []
for i in df['Product'] :
i = i.split()
Company.append(i[0])
df['Company'] = Company
print('The new companies list:')
display(df['Company'].unique().tolist()) # This shows all the companies
# take out unwanted columns
df.drop(labels=['Product','Vendor'], axis= 1,inplace= True)
display(df)
The current companies list:
['AMD', 'Intel', 'NVIDIA', 'ATI', 'Other']
The new companies list:
['AMD', 'Intel', 'NVIDIA', 'ATI', 'XGI', 'Matrox']
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | Foundry | FP32 GFLOPS | FP64 GFLOPS | Performance | Company | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | CPU | 2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | Unknown | 2135.0 | 364.0 | Medium | AMD |
| 1 | CPU | 2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | Unknown | 2135.0 | 364.0 | High | AMD |
| 2 | CPU | 2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | Intel | 2135.0 | 364.0 | Medium | Intel |
| 3 | CPU | 2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | Intel | 2135.0 | 364.0 | Low | Intel |
| 4 | CPU | 2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | Unknown | 2135.0 | 364.0 | High | AMD |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3927 | GPU | 2010 | 40.0 | 150.0 | 334.0 | 2154.0 | 700 | TSMC | 2016.0 | 403.0 | Medium | ATI |
| 3928 | GPU | 2013 | 28.0 | 21.0 | 68.0 | 302.0 | 550 | Sony | 2135.0 | 364.0 | Medium | NVIDIA |
| 3929 | GPU | 2011 | 40.0 | 75.0 | 332.0 | 1950.0 | 450 | TSMC | 432.0 | 36.0 | Low | NVIDIA |
| 3930 | GPU | 2005 | 90.0 | 7.0 | 188.0 | 1930.0 | 250 | Intel | 2135.0 | 364.0 | Low | Intel |
| 3931 | GPU | 2010 | 40.0 | 23.0 | 100.0 | 486.0 | 500 | TSMC | 53.0 | 364.0 | Medium | NVIDIA |
3932 rows × 12 columns
# Encoding to convert catagorical data into numerical
newDf = df
# Since there is an order of the data we'll use custom encoding
performanceEncoder = {"Low":0, "Medium": 1,'High':2}
typeEncoder = {"GPU":0 , "CPU": 1 }
newDf['Performance'] = newDf['Performance'].map(performanceEncoder)
newDf['Type'] = newDf['Type'].map(typeEncoder)
# Since there is no order in these columns we'll use one-hot-encoding
newDf = pd.get_dummies(newDf, columns=['Foundry'],drop_first=True)
newDf = pd.get_dummies(newDf, columns=['Company'],drop_first=True)
display(newDf)
# display(newDf.info())
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Samsung | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2007 | 65.0 | 45.0 | 77.0 | 122.0 | 2200 | 2135.0 | 364.0 | 1 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 2018 | 14.0 | 35.0 | 192.0 | 4800.0 | 3200 | 2135.0 | 364.0 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 2020 | 10.0 | 28.0 | 188.0 | 1930.0 | 2600 | 2135.0 | 364.0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 2013 | 22.0 | 80.0 | 160.0 | 1400.0 | 1800 | 2135.0 | 364.0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | 1 | 2011 | 45.0 | 125.0 | 258.0 | 758.0 | 3700 | 2135.0 | 364.0 | 2 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3927 | 0 | 2010 | 40.0 | 150.0 | 334.0 | 2154.0 | 700 | 2016.0 | 403.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3928 | 0 | 2013 | 28.0 | 21.0 | 68.0 | 302.0 | 550 | 2135.0 | 364.0 | 1 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3929 | 0 | 2011 | 40.0 | 75.0 | 332.0 | 1950.0 | 450 | 432.0 | 36.0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3930 | 0 | 2005 | 90.0 | 7.0 | 188.0 | 1930.0 | 250 | 2135.0 | 364.0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 3931 | 0 | 2010 | 40.0 | 23.0 | 100.0 | 486.0 | 500 | 53.0 | 364.0 | 1 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3932 rows × 23 columns
# Generate Train - Test splits
from sklearn.model_selection import train_test_split
X = newDf.drop(labels='Performance',axis=1,inplace=False).values # X: for input values
y = newDf['Performance'].values # y: for the output value (performance column)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=22)
# Scaling the Train - Test splits
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(np.c_[X_train,y_train]) # we only fit the train data but not the test data
# assigning the varables
A_train = scaler.transform(np.c_[X_train,y_train])
X_train = A_train[:,:-1]
y_train = A_train[:,-1]
A_test = scaler.transform(np.c_[X_test,y_test])
X_test = A_test[:,:-1]
y_test = A_test[:,-1]
print(A_train)
[[-0.97063446 0.30128559 -0.37391772 ... -0.54151 -0.05049843 -0.04121826] [ 1.03025396 -1.08689082 0.26375274 ... -0.54151 -0.05049843 -1.26092413] [-0.97063446 1.88777291 -1.08810864 ... 1.84668796 -0.05049843 1.17848761] ... [ 1.03025396 0.4995965 -0.83304046 ... -0.54151 -0.05049843 -1.26092413] [-0.97063446 0.4995965 -0.67999955 ... -0.54151 -0.05049843 1.17848761] [-0.97063446 -1.28520174 1.92169594 ... 1.84668796 -0.05049843 -1.26092413]]
# Regression Analysis: Mean Squared Error Metric
from sklearn.metrics import mean_squared_error
## OLS
from sklearn.linear_model import LinearRegression
reg1 = LinearRegression(fit_intercept=False).fit(X_train, y_train)
y_pred1 = reg1.predict(X_test)
print('The MSE using OLS is:', mean_squared_error(y_test, y_pred1))
## Ridge
from sklearn.linear_model import RidgeCV
reg2 = RidgeCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3], fit_intercept=False,cv=10).fit(X_train, y_train) # the values of alpha can be provided by us or leave it for the function to decide the best ones
y_pred2 = reg2.predict(X_test)
print('The MSE using Ridge is:', mean_squared_error(y_test, y_pred2))
## Lasso
from sklearn.linear_model import LassoCV
reg3 = LassoCV(alphas=[1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3],fit_intercept=False,cv=10, random_state=0).fit(X_train, y_train) # the random to get the same seed (same values again)
y_pred3 = reg3.predict(X_test)
print('The MSE using Lasso is:', mean_squared_error(y_test, y_pred3))
best_beta = np.round(reg2.coef_,2)
best_beta_0 = np.round(reg2.intercept_,2)
print(f'The best values for the estimates are :', best_beta_0, best_beta.tolist())
The MSE using OLS is: 0.21171304984851613 The MSE using Ridge is: 0.21156615106334847 The MSE using Lasso is: 0.21137530104064628 The best values for the estimates are : 0.0 [-1.06, -0.14, -0.28, 0.12, -0.05, -0.04, 1.46, 0.03, -0.19, -0.04, -0.0, 0.0, 0.03, 0.02, 0.23, 0.03, -0.03, -0.07, -0.04, -0.01, -0.11, -0.02]
print('The best penalty coefficient is:', reg3.alpha_)
print('\nThe best coefficient estimates are:')
display(pd.DataFrame(reg3.coef_).T)
The best penalty coefficient is: 0.001 The best coefficient estimates are:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1.119166 | -0.133455 | -0.26889 | 0.112761 | -0.048367 | -0.031138 | 1.505112 | 0.023111 | -0.181232 | -0.063507 | ... | 0.029195 | 0.014654 | 0.231653 | 0.027514 | -0.011208 | -0.068274 | -0.0 | -0.00684 | -0.107215 | -0.020611 |
1 rows × 22 columns
# First train and test the data to prepare it for fitting
# Generate Train - Test splits
from sklearn.model_selection import train_test_split # the same process as in the Regression
X = newDf.drop('Performance', axis=1).values # we'll use the same data as in the Regression
y = newDf['Performance'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
display(y_test)
display(np.unique(y_test, return_counts=True))
# Scaling the Train - Test splits
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
array([1, 2, 2, ..., 2, 2, 0], dtype=int64)
(array([0, 1, 2], dtype=int64), array([365, 401, 414], dtype=int64))
# Import the wanted library
from sklearn import tree
dtClf = tree.DecisionTreeClassifier(random_state=0,criterion='entropy',splitter='best') # Splitting by the best IG value
dtClf = dtClf.fit(X_train,y_train)
import matplotlib.pyplot as plt
plt.figure(figsize =(10,10),dpi=1000)
tree.plot_tree(dtClf,
feature_names=newDf.drop('Performance', axis=1).columns,
class_names=['Low','Medium','High'], # provide the output unique values (you can change the names but not the order)
filled=True,
rounded=True);
plt.show()
# Showing the tree classifier in a text form
print(" Class 0 = Low\n","Class 1 = Medium\n","Class 2 = High\n")
print(tree.export_text(dtClf,
feature_names=newDf.drop('Performance', axis=1).columns.tolist()))
Class 0 = Low Class 1 = Medium Class 2 = High |--- Freq (MHz) <= 1.20 | |--- Freq (MHz) <= -0.98 | | |--- class: 0 | |--- Freq (MHz) > -0.98 | | |--- Type <= 0.04 | | | |--- Freq (MHz) <= -0.76 | | | | |--- class: 1 | | | |--- Freq (MHz) > -0.76 | | | | |--- class: 2 | | |--- Type > 0.04 | | | |--- Freq (MHz) <= 0.61 | | | | |--- class: 0 | | | |--- Freq (MHz) > 0.61 | | | | |--- class: 1 |--- Freq (MHz) > 1.20 | |--- class: 2
# To calculate the accuracy and confusion of the Decision Tree
dt_y_pred = dtClf.predict(X_test)
from sklearn.metrics import accuracy_score, confusion_matrix
print("Decision Tree: \n")
print("Accuracy:=", accuracy_score(y_test, dt_y_pred))
print("Confusion Matrix:= \n", confusion_matrix(y_test, dt_y_pred) )
Decision Tree: Accuracy:= 1.0 Confusion Matrix:= [[365 0 0] [ 0 401 0] [ 0 0 414]]
# We'll use GaussianNB to calculate naive bayes since our values are not catagorical
from sklearn.naive_bayes import GaussianNB
NBClf = GaussianNB()
NBClf.fit(X_train,y_train)
from sklearn.metrics import accuracy_score, confusion_matrix
NB_y_pred = NBClf.predict(X_test)
print("Accuracy:=", accuracy_score(y_test, NB_y_pred))
print("Confusion Matrix:= \n", confusion_matrix(y_test, NB_y_pred) )
Accuracy:= 0.36271186440677966 Confusion Matrix:= [[ 13 2 350] [ 5 1 395] [ 0 0 414]]
y = newDf["Performance"].values
X = newDf.drop("Performance", axis=1)
X_train, X_test, y_train, y_test,ind_train,ind_test = train_test_split(X, y,newDf.index, test_size=0.3, random_state=10)
X_train_org=X_train #for visualize in the graph
# now we will scale the the Train - by (Test splits )
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(15,6))
plt.title("Data Visualising")
first_dendrogram = shc.dendrogram((shc.linkage(newDf, method ='complete',metric='euclidean')))
plt.xticks([])
#remove the labels that we do not need, to make the code working
plt.ylabel("Height")
plt.show()
# here we work on (K-means) in clustering
from sklearn.cluster import KMeans
from sklearn import metrics
values_of_k=range(2,20) # put values that can be used for k value
miximum_shil_N= np.empty([len(values_of_k),1])
min_s_dabo= np.empty_like(miximum_shil_N) # now, make new empty arrays to count the internal measures
# # now, make new empty arrays to count the external measures
max_RNDAdj = np.empty_like(miximum_shil_N)
maximum_NMI = np.empty_like(miximum_shil_N)
for i,n in enumerate(values_of_k):
kmeans = KMeans(n_clusters=n, max_iter=3000, n_init=10, random_state=0).fit(X_train)
miximum_shil_N[i] = metrics.silhouette_score(X_train,kmeans.labels_)
min_s_dabo[i] = metrics.davies_bouldin_score(X_train,kmeans.labels_)
max_RNDAdj[i]= metrics.cluster.adjusted_rand_score(y_train, kmeans.labels_)
maximum_NMI[i]=metrics.cluster.normalized_mutual_info_score(y_train, kmeans.labels_)
plt.plot(values_of_k, miximum_shil_N, 'o:',c='r')
plt.plot(values_of_k, min_s_dabo, 's:',c='b')
plt.plot(values_of_k, max_RNDAdj, '^:',c='g')
plt.plot(values_of_k, maximum_NMI, 'd:',c='m')
plt.xlabel("values_k")
plt.ylabel("Meaning of Cluster")
plt.legend(['SI','DBI','ARI','NMI'])
plt.show()
#print("-"*100)
# Hierarchical cluster analysis
from sklearn.cluster import AgglomerativeClustering
# put values that can be used for k values
values_of_k=range(2,20)
#now, make new empty arrays to count the interanl measures
miximum_shil_N = np.empty([len(values_of_k),1])
min_s_dabo = np.empty_like(miximum_shil_N)
# now, make new empty arrays to count the external measures
max_RNDAdj = np.empty_like(miximum_shil_N)
maximum_NMI = np.empty_like(miximum_shil_N)
for i,n in enumerate(values_of_k):
hclus = AgglomerativeClustering(n_clusters=n,linkage="complete").fit(X_train)
miximum_shil_N[i] = metrics.silhouette_score(X_train,hclus.labels_)
min_s_dabo[i] = metrics.davies_bouldin_score(X_train,hclus.labels_)
max_RNDAdj[i]= metrics.cluster.adjusted_rand_score(y_train, hclus.labels_)
maximum_NMI[i]=metrics.cluster.normalized_mutual_info_score(y_train, hclus.labels_)
plt.plot(values_of_k, miximum_shil_N, 'o:',c='r')
plt.plot(values_of_k, min_s_dabo, 's:',c='b')
plt.plot(values_of_k, max_RNDAdj, '^:',c='g')
plt.plot(values_of_k, maximum_NMI, 'd:',c='m')
plt.xlabel("values_K")
plt.ylabel("Meaning of Cluster")
plt.legend(['SI','DBI','ARI','NMI'])
plt.show()
# test the best k value that equal 13
kmeans = KMeans(n_clusters=13, max_iter=3000, n_init=10, random_state=0,algorithm = "elkan").fit(X_train)
y_pred=kmeans.predict(X_test)
miximum_shil_N = metrics.silhouette_score(X_test,y_pred)
min_s_dabo = metrics.davies_bouldin_score(X_test,y_pred)
max_RNDAdj= metrics.cluster.adjusted_rand_score(y_test, y_pred)
maximum_NMI= metrics.cluster.normalized_mutual_info_score(y_test, y_pred)
print(f"The SI is {miximum_shil_N}, DBI score is {min_s_dabo}, ARI score is {max_RNDAdj}, and NMI score is {maximum_NMI}.")
The SI is 0.322994277772063, DBI score is 0.8766464844108177, ARI score is 0.04423287019120203, and NMI score is 0.0944062622160226.
df2=pd.DataFrame(np.c_[X_train_org,y_train],columns=newDf.columns)
df2['cluster']=kmeans.labels_
df2['Performance']=newDf.loc[ind_train,'Performance'].values
selected_columns=df2.columns.drop(['cluster','Performance'])
fig,axes = plt.subplots(8, 3, figsize=(30,35))
for ind,col in enumerate(selected_columns):
sns.violinplot(y=col,x='cluster',data=df2,ax=axes.flatten()[ind])
plt.show()
here when we used the max number of K (13) we got maximum SI score of 0.43 and a minimum DBI of 0.74.
as we clearly notify GPU's, Brands, and foundries are easy to identify.
clusters_test = [1,2,3,5,6,7,8,9,10,11,12]
for c in clusters_test:
rows_test = df2['cluster']==c
print(f"\ncluster {c}:")
display(df2.loc[rows_test].head())
print("-"*120)
cluster 1:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 2011.0 | 40.0 | 75.0 | 212.0 | 1700.0 | 680.0 | 1306.0 | 364.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
| 7 | 0.0 | 2016.0 | 28.0 | 81.0 | 125.0 | 1550.0 | 780.0 | 589.0 | 37.0 | 2 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1 |
| 19 | 0.0 | 2020.0 | 14.0 | 15.0 | 210.0 | 4940.0 | 300.0 | 461.0 | 29.0 | 0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 26 | 0.0 | 2010.0 | 40.0 | 18.0 | 75.0 | 450.0 | 488.0 | 78.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 28 | 0.0 | 2013.0 | 32.0 | 65.0 | 246.0 | 1303.0 | 760.0 | 195.0 | 364.0 | 2 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 1 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 2:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3 | 1.0 | 2001.0 | 130.0 | 62.0 | 101.0 | 63.0 | 1500.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 5 | 1.0 | 2001.0 | 90.0 | 51.0 | 84.0 | 69.0 | 2000.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 12 | 1.0 | 2012.0 | 32.0 | 65.0 | 246.0 | 1178.0 | 3400.0 | 2135.0 | 364.0 | 2 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 2 |
| 16 | 1.0 | 2004.0 | 130.0 | 82.0 | 193.0 | 106.0 | 1600.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 21 | 1.0 | 2003.0 | 130.0 | 82.0 | 193.0 | 106.0 | 1600.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 3:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | 0.0 | 2001.0 | 150.0 | 23.0 | 83.0 | 60.0 | 290.0 | 2135.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
| 18 | 0.0 | 2013.0 | 130.0 | 81.0 | 76.0 | 60.0 | 325.0 | 2135.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
| 25 | 0.0 | 2007.0 | 65.0 | 81.0 | 85.0 | 180.0 | 450.0 | 36.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
| 27 | 0.0 | 2005.0 | 90.0 | 31.0 | 100.0 | 107.0 | 600.0 | 2135.0 | 364.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
| 29 | 0.0 | 2009.0 | 55.0 | 81.0 | 73.0 | 242.0 | 500.0 | 80.0 | 364.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 5:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 0.0 | 2006.0 | 90.0 | 45.0 | 196.0 | 278.0 | 575.0 | 2135.0 | 364.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 5 |
| 17 | 0.0 | 2008.0 | 65.0 | 78.0 | 324.0 | 754.0 | 500.0 | 280.0 | 364.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 5 |
| 31 | 0.0 | 2004.0 | 130.0 | 81.0 | 133.0 | 82.0 | 425.0 | 2135.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 5 |
| 38 | 0.0 | 2008.0 | 65.0 | 50.0 | 144.0 | 314.0 | 600.0 | 96.0 | 364.0 | 1 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 5 |
| 51 | 0.0 | 2010.0 | 40.0 | 25.0 | 100.0 | 486.0 | 475.0 | 53.0 | 364.0 | 0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 5 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 6:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 990 | 0.0 | 2013.0 | 90.0 | 81.0 | 188.0 | 1930.0 | 250.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 6 |
| 1621 | 0.0 | 2006.0 | 150.0 | 81.0 | 174.0 | 80.0 | 250.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 6 |
| 1781 | 0.0 | 2004.0 | 150.0 | 81.0 | 174.0 | 80.0 | 250.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 6 |
| 2324 | 0.0 | 2002.0 | 150.0 | 81.0 | 174.0 | 80.0 | 220.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 6 |
4 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 7:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1424 | 0.0 | 2012.0 | 40.0 | 33.0 | 146.0 | 880.0 | 550.0 | 176.0 | 364.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 7 |
1 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 8:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199 | 0.0 | 2021.0 | 8.0 | 81.0 | 188.0 | 1930.0 | 712.0 | 4329.0 | 68.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 8 |
| 333 | 0.0 | 2016.0 | 14.0 | 75.0 | 132.0 | 3300.0 | 1354.0 | 1862.0 | 58.0 | 2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 8 |
| 633 | 0.0 | 2017.0 | 14.0 | 30.0 | 74.0 | 1800.0 | 1228.0 | 1127.0 | 35.0 | 2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 8 |
| 728 | 0.0 | 2017.0 | 14.0 | 75.0 | 132.0 | 3300.0 | 1354.0 | 1911.0 | 60.0 | 2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 8 |
| 851 | 0.0 | 2019.0 | 14.0 | 18.0 | 74.0 | 1800.0 | 1303.0 | 1147.0 | 36.0 | 2 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 8 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 9:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 118 | 0.0 | 2003.0 | 130.0 | 81.0 | 188.0 | 110.0 | 350.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9 |
| 534 | 0.0 | 2003.0 | 130.0 | 81.0 | 188.0 | 90.0 | 325.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9 |
| 761 | 0.0 | 2003.0 | 130.0 | 81.0 | 188.0 | 90.0 | 350.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9 |
| 1544 | 0.0 | 2003.0 | 130.0 | 81.0 | 188.0 | 110.0 | 350.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9 |
| 1627 | 0.0 | 2003.0 | 130.0 | 81.0 | 188.0 | 90.0 | 350.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 10:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 868 | 0.0 | 2012.0 | 40.0 | 35.0 | 114.0 | 302.0 | 550.0 | 2135.0 | 364.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 10 |
| 1438 | 0.0 | 2013.0 | 28.0 | 21.0 | 68.0 | 302.0 | 550.0 | 2135.0 | 364.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 10 |
| 1770 | 0.0 | 2013.0 | 65.0 | 58.0 | 186.0 | 300.0 | 550.0 | 2135.0 | 364.0 | 1 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 10 |
3 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 11:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 660 | 0.0 | 2006.0 | 90.0 | 45.0 | 95.0 | 107.0 | 243.0 | 2135.0 | 364.0 | 0 | ... | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11 |
1 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------ cluster 12:
| Type | Release Date | Process Size (nm) | TDP (W) | Die Size (mm^2) | Transistors (million) | Freq (MHz) | FP32 GFLOPS | FP64 GFLOPS | Performance | ... | Foundry_Sony | Foundry_TSMC | Foundry_UMC | Foundry_Unknown | Company_ATI | Company_Intel | Company_Matrox | Company_NVIDIA | Company_XGI | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0.0 | 2017.0 | 16.0 | 75.0 | 314.0 | 7200.0 | 1088.0 | 3110.0 | 97.0 | 2 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 12 |
| 13 | 0.0 | 2011.0 | 40.0 | 210.0 | 520.0 | 3000.0 | 732.0 | 1312.0 | 164.0 | 2 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 12 |
| 20 | 0.0 | 2015.0 | 28.0 | 225.0 | 398.0 | 5200.0 | 557.0 | 4825.0 | 151.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 12 |
| 22 | 0.0 | 2009.0 | 55.0 | 188.0 | 470.0 | 1400.0 | 610.0 | 622.0 | 78.0 | 1 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 12 |
| 36 | 0.0 | 2014.0 | 28.0 | 150.0 | 366.0 | 5000.0 | 920.0 | 3297.0 | 206.0 | 2 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 12 |
5 rows × 24 columns
------------------------------------------------------------------------------------------------------------------------
hclus = AgglomerativeClustering(n_clusters=19).fit(X_train)
df3=df2.drop("cluster", axis=1)
df3=pd.concat([df3, pd.DataFrame({'cluster':hclus.labels_})], axis=1)
for c in df2.drop(columns=['cluster']):
grid= sns.FacetGrid(df2, col='cluster')
grid.map(plt.hist, c)
C:\Users\omar3\anaconda3\lib\site-packages\seaborn\axisgrid.py:409: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). fig = plt.figure(figsize=figsize)